William Langdon published in April 2014 in BioData Mining about mycoplasma reads in the 1000 Genomes sequencing data. He tested 2% of the total runs produced by the project (3982/187720) and found 7% of them (269/3982) to be contamintated with mycoplasma. A full description of the analysis can be found in his paper.
A full list of the runs Langdon tested and their contamination status can be found on our FTP site.
We recognise that there are mycoplasma sequence reads in some of the 1000 Genomes Project raw data sets but we do not believe that they have affected any of the published results and analyses, nor the use of data from the project by additional users.
The primary outcome from the 1000 Genomes Project is a collection of more than 35 million human genetic variants. These are obtained from reads that map to the human reference genome. As Langdon points out, the mycoplasma reads identified by Langdon do not map to the human reference genome, so they do not contaminate the results on human genetic variation. The 1000 Genomes Project makes its raw data sets available for reanalysis, and the complete read sets include mycoplasma reads, as well as reads from Epstein-Barr virus (EBV) and potentially from other non-human organisms that may have been present in the starting material. However the project also makes aligned data sets available and the vast majority of users only examine the reads aligned to the human reference, along with the inferred individual genome sequences derived from them. We make all the original raw data available as a matter of policy, both for transparency with respect to our data processing, and also to support those who would like to examine additional technical or biological phenomena that can be derived from the data.
Most of the DNA used for 1000 Genomes Project sequencing was obtained from immortalised cell lines and, although mycoplasmal infection of laboratory cell lines is undesirable, it is not a great surprise that some of these had mycoplasma infections, especially given that some of the cell lines and DNA were prepared a long time ago.
You can tell when a VCF file contains a phased genotype as the delimiter used in the GT field is a pipe symbol | e.g
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096
10 60523 rs148087467 T G 100 PASS AC=0;AF=0.01;AFR_AF=0.06;AMR_AF=0.0028;AN=2; GT:GL 0|0:-0.19,-0.46,-2.28
The VCF files produced by the final phase of the 1000 Genomes Project (phase 3) are phased. They can be found in the final release directory from the project and in the directory supporting the final publications.
The majority of the VCF files in official releases over the life time of the project contained phased variants. This is also true for the pilot, phase 1 and final phase 3 data sets.
The phase 1 release files contain global R2 values but you can also use the VCF to plink converter if you wish to use our files with haploview or another similar tool.
The final data set produced by the 1000 Genomes Project was the phase 3 integrated data set. This contains fully phased haplotypes for 2,504 individuals. Full details can be found in the 1000 Genomes project phase 3 publication.
Our pilot data is all presented with respect to NCBI36 and our main project data is all presented with respect to GRCh37. If you need variant calls to be in a particular assembly it is best to go to dbSNP, Ensembl or an equivalent archive using their rs numbers as this will provide a definitive mapping.
If an rs number or equivalent is not available there are tools available to map between NCBI36, GRCh37 and GRCh38 from both Ensembl and the NCBI
The developers of Beagle, Mach and Impute2 have all created data sets based on the 1000 Genomes data to use for imputation.
Please look at the software’s website to find those files.
The 1000 Genomes Project shares some samples with the HapMap project; any sample which starts with NA was likely part of the HapMap project. In the pilot stages of the project HapMap genotypes were also used to help quality control the data and identify sample swaps and contamination. Since phase 1 the HapMap data has not been used by the 1000 Genomes Project, and all genotypes were independantly identified by 1000 Genomes.
The 1000 Genomes Project has used several different alignment algorithms during its duration:
Project stage | Sequencing technology | Alignment algorithm |
---|---|---|
Pilot | Illumina | MAQ |
Pilot | SOLiD | Corona lite |
Pilot | 454 | ssaha |
Main | Illumina | BWA |
Main | SOLiD | BFAST |
Main | 454 | ssaha (first set) |
Main | 454 | smalt (final set) |
The full process is described in the README
As part of our phase 1 analysis we performed functional annotation of our phase 1 variants with respect to both coding and non-coding annotation from GENCODE and the ENCODE project respectively.
This functional annotation can be found in our phase 1 analysis results directory. We present both the annotation we compared the variants to and VCF files which contain the functional consequences for each variant.
CRAM files are alignment files like BAM files. They represent a compressed version of the alignment. This compression is driven by the reference the sequence data is aligned to.
The file format was designed by the EBI to reduce the disk footprint of alignment data in these days of ever-increasing data volumes.
The CRAM files the 1000 genomes project distributes are lossy cram files which reduce the base quality scores using the Illumina 8-bin compression scheme as described in the lossy compression section on the cram usage page
There is a cram developers mailing list where the format is discussed and help can be found.
CRAM files can be read using many Picard tools and work is being done to ensure samtools can also read the file format natively.
The unmapped bams contain all the reads for the given individual which could not be placed on the reference genome. It contains no mapping information
Please note that any paired end sequence where one end successfully maps but the other does not both reads are found in the mapped bam
The Phase 1 integrated variant set does not report the depth of coverage for each individual at each site. We instead report genotype likelihoods and dosage. If you would like to see depth of coverage numbers you will need to calculate them directly.
The bedtools suite provides a method to do this.
genomeCoverageBed is a tool which can provide a bed file which specifies coverage for every base in the genome and intersectBed which will provide an intersection between two vcf/bed/bam files
These commands also require samtools, tabix and vcftools to be installed
An example set of commands would be
samtools view -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG01375/alignment/HG01375.mapped.ILLUMINA.bwa.CLM.low_coverage.20120522.bam 2:1,000,000-2,000,000 | genomeCoverageBed -ibam stdin -bg > coverage.bg
This command gives you a bedgraph file of the coverage of the HG01375 bam between 2:1,000,000-2,000,000
tabix -h http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.chr2.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz 2:1,000,000-2,000,000 | vcf-subset -c HG01375 | bgzip -c > HG01375.vcf.gz
This command gives you the vcf file for 2:1,000,000-2,000,000 with just the genotypes for HG01375
To get the coverage for all those sites you would use
intersectBed -a HG01375.vcf.gz -b coverage.bg -wb > depth_numbers.vcf
You can find more information about bed file formats please see the Ensembl File Formats Help
For more information you may wish to look at our documentation about data slicing
The 1000 Genomes Project was divided into stages. Initially, a set of pilot projects were undertaken, followed by the main project, which was broken into three phases.
The initial part of the Project was called the pilot project. This was split into three pilot studies: the low coverage pilot (pilot 1), the high coverage pilot (pilot 2) and the exon targeted pilot (pilot 3). This data was completed in 2009 and published in Nature in 2010. All of the data associated with the pilot projects is available in the pilot_data directory on the FTP site.
Phase 1 represented low coverage and exome data analysis for the first 1092 samples. The phase 1 low coverage alignments and exome alignments are available in the phase 1 directory on the FTP site. Analysis of phase 1 was published in 2012. The analysis results associated with the paper can be found in the phase 1 analysis_results directory. The low coverage sequence data from phase 1 is listed in the 20101123 sequence index and the exome data in the 20110521 sequence index.
During phase 2 the set of samples expanded to around 1700 in number. The sequence data is represented in the 20111114 sequence index. This data was used for method development, to both improve on existing methods from phase 1 and also develop new methods to handle features like multi-allelic variant sites and true integration of complex variation and structural variants.
Phase 3 represents 2504 samples, including additional African samples and samples from South Asia. The methods developed in phase 2 were applied to this data set and a final catalogue of variation was released on the FTP site. These results were published in two publications in 2015, one covering the main project and the other focusing on structural variation.
The Genotype Dosage value comes from Mach/Thunder, imputation engine used for genotype refinement in the phase 1 data set.
The Dosage represents the predicted dosage of the non reference allele given the data available, it will always have a value between 0 and 2.
The formula is Dosage = Pr(Het|Data) + 2*Pr(Alt|Data)
The dosage value gives an indication of how well the genotype is supported by the imputation engine. The genotype likelihood gives an indication of how well the genotype is supported by the sequence data.
Our released genotypes are created not just using sequence evidence but also imputation. If an individual has no coverage at a particular location but overall we have been able to determine there is variation at that location then we can statistically infer the genotype for that variant in that individual using haplotype information. This means we are able to provide complete haplotypes for all the variation we discover. For more information about how the genotype calling was done please refer to our phase 1 publication.
HLA diversity is not something which was studied by the 1000 Genomes Project directly, however, groups have looked at the HLA diversity of the samples in the 1000 Genomes Project.
The most recent of these studies was published by Laurent Abi-Rached, Julien Paganini and colleagues in 2018 and covers 2,693 samples from the work of the 1000 Genomes Project. Details of the study and data used in this work are available via the publication and the HLA types are available on our FTP site at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HLA_types/.
The FTP site also hosts data from an earlier study by Pierre-Antoine Gourraud, Jorge Oksenberg and colleages at UCSF who carried out an HLA typing assay on DNA sourced from Coriell for 1000 Genomes samples. This earlier study looks at only the 1,267 samples that were available at that time.
The earlier work assessing HLA Diversity is publised in “HLA diversity in the 1000 Genome Dataset”, with data available from the 1000 Genomes FTP site in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140725_hla_genotypes/.
The ancestral alleles associated with the phase 1 release where generated using two different processes.
The SNP ancestral alleles were derived from Ensembl Compara release 59. The alignments used to generate them can be found in the phase1/supporting directory.
The indel ancestral alleles were generated using an separate process
The deletions should not have any ancestral alleles
Our main releases are contained in the release directory on our ftp site.
These directories are dated for the sequence index freeze the analysis products are based on.
Our most current recent release is the phase 3 release in 20130502. This represents an integrated set of SNPs, indels, MNPs, long insertions and deletions, copy number variations, and other types of structural variations discovered and genotyped in 2504 unrelated individuals.
The reference assembly the 1000 Genomes Project has mapped sequence data to has changed over the course of the project.
For the pilot phase we mapped data to NCBI36. A copy of our reference fasta file can be found on the ftp site.
For the phase 1 and phase 3 analysis we mapped to GRCh37. Our fasta file which can be found on our ftp site called human_g1k_v37.fasta.gz, it contains the autosomes, X, Y and MT but no haplotype sequence or EBV.
Our most recent alignment release was mapped to GRCh38, this also contained decoy sequence, alternative haplotypes and EBV. It was mapped using an alt aware version of BWA-mem. The fasta files can be found on our ftp site
As the majority of sites in the genome only has only been sequenced to low coverage, in all our individuals some sites genotypes will be based on imputation.
The process used to create our genotypes first gave our merged sites and genotype likelihoods sets to Beagle to generate initial haplotypes (using 50 interations across all samples) and these were refined using a modified version of Thunder (it used 300 states chosen by longest matching haplotype at each iteration in addition to 100 randomly chosen states).
This process means we are unable to precisely identify which sites used imputation to generate their genotype. Without this process the approximate error rate for our heterozygous sites would be 20% so you can estimate that 20% of our heterozgous sites will have been changed on the basis of imputation. The sites covered by our exome sequencing represent our highest accuracy sites and these are the least likely to have been changed by this process. The converse is also true any site without any sequence alignment will have been imputed. You can find the depth of coverage at any site using our bam files. Other sites may have been given greater evidence on the basis of the imputation and refinement process.
You can find out more about this in our Phase 1 paper.
In some early main project releases the allele frequency (AF) was estimated using additional information like LD, mapping quality and Haplotype information. This means in these releases the AF was not always the same as allele count/allele number (AC/AN). In the phase 1 release AF should always match AC/AN rounded to 2 decimal places.
The phase 1 variants list released in 2012 and the phase 3 variants list released in 2014 overlap but phase 3 is not a complete superset of phase 1. The variant positions between phase 3 and phase 1 releases were compared using their positions. This shows that 2.3M phase 1 sites are not present in phase 3. Of the 2.3M sites, 1.92M are SNPs, the rest are either indels or structural variations (SVs).
The difference between the two lists can be explained by a number of different reasons.
1. Some phase 1 samples were not used in phase 3 for various reasons. If a sample was not part of phase 3, variants private to this sample are not be part of the phase 3 set.
2. Our input sequence data is different. In phase 1 we had a mixture of both read lengths 36bp to >100bp and a mixture of sequencing platforms, Illumina, ABI SOLiD and 454. In phase 3 we only used data from the Illumina sequencing platform and we only used read lengths of 70bp+. We believe that these calls are higher quality, and that variants excluded this way were probably not real.
3. The first two reasons listed explain 548k missing SNPs, leaving 1.37M SNPs still to be explained.
The phase 1 and phase 3 variant calling pipelines are different. Phase 3 had an expanded set of variant callers, used haplotype aware variant callers and variant callers that used de novo assembly. It considered low coverage and exome sequence together rather than independently. Our genotype calling was also different using ShapeIt2 and MVNcall, allowing integration of multi allelic variants and complex events that weren’t possible in phase 1.
891k of the 1.37M sites missing from phase 1 were not identified by any phase 3 variant caller. These 891k SNPs have relatively high Ts/Tv ratio (1.84), which means these were likely missed in phase 3 because they are very rare, not because they are wrong; the increase in sample number in phase 3 made it harder to detect very rare events especially if the extra 1400 samples in phase 3 did not carry the alternative allele.
481k of these SNPs were initially called in phase 3. 340k of them failed our initial SVM filter so were not included in our final merged variant set. 57k overlapped with larger variant events so were not accurately called. 84k sites did not make it into our final set of genotypes due to losses in our pipeline. Some of these sites will be false positives but we have no strong evidence as to which of these sites are wrong and which were lost for other reasons.
4. The reference genomes used for our alignments are different. Phase 1 alignments were aligned to the standard GRCh37 primary reference including unplaced contigs. In phase 3 we added EBV and a decoy set to the reference to reduce mismapping. This will have reduced our false positive variant calling as it will have reduced mismapping leading to false SNP calls. We cannot quantify this effect.
We have made no attempt to eludcidate why our SV and indel numbers changed. Since the release of phase 1 data, the algorithms to detect and validate indels and SVs have improved dramatically. By and large, we assume the indels and SVs in phase 1 that are missing from phase 3 are false positive in phase 1.
You can get more details about our comparison from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/phase1_sites_missing_in_phase3/
The 1000 Genomes Project created what they defined as accessibilty masks for the pilot phase, phase 1 and phase 3 of the Project.
The pilot mask showed that only 85% of the genome is accessible to accurate analysis with the short read technology used by the 1000 Genomes pilot project. The remaining 15% is either repeats or segmental duplications. There is more information about the pilot mask in README.callability_masks.
For the phase 1 analysis using the pilot callability criteria 94% of the genome was accessible. A stricter mask was also created for the phase 1 project to be used for population genetics analysis; this mask used a narrower band of coverage criteria and also insisted that less than 0.1% of reads have a mapping quality of 0 and the average mapping quality should be 56 or higher. These criteria lead to 72.2% of the genome being accessible to accurate analysis with the short read technology used at that time by the 1000 Genomes Project. Further information is in section 10.4 of the supplementary material from the phase 1 publication.
In phase 3, using the pilot criteria 95.9% of the genome was found to be accessible. For the stricter mask created during phase 3, 76.9% was found to be accessible. A detailed description of the accessibility masks created during phase 3, the final phase of the Project, can be found in section 9.2 of the supplementary material for the main publication. The percentages quoted are for non-N bases.